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ABSTRACT 

A  global  state  of  a  distributed  transaction 
system  is  consistent  if  no  transactions  aTe  in 
progress.  A  global  checkpoint  is  a  transaction 
which  must  view  a  globally  consistent  system  state 
for  correct  operation.  We  present  an  algorithm 
for  adding  global  checkpoint  transactions  to  an 
arbitrary  distributed  transaction  system.  The 
algorithm  is  non-intruslve  In  the  sense  that 
checkpoint  transactions  do  not  interfere  with 
ordinary  transactions  in  progress;  however,  the 
checkpoint  transactions  still  produce  meaningful 
results. 

1.  Introduction 

Computing  systems  operate  by  a  sequence  of 
internal  transitions  on  the  global  state  of  the 
system.  The  global  state  represents  the  collec¬ 
tive  state  of  a  set  of  objects  which  the  system 
controls.  Often  many  primitive  state  transitions 
are  necessary  to  accomplish  a  larger  semantically- 
meaningful  task,  called  a  transaction.  Transac¬ 
tions  are  designed  to  take  the  system  from  one 
meaningful  or  consistent  state  to  another,  but 
during  the  execution  of  the  transaction,  the 
system  may  go  through  inconsistent  Intermediate 
states.  Thus,  to  insure  consistency  of  the 
system  state,  every  transaction  must  either  be 
run  to  completion  or  not  run  at  all. 

Transactions  are  often  the  basis  for  concur¬ 
rency  control.  In  a  distributed  database  system, 
a  standard  criterion  for  correctness  of  a  system 
is  that  all  allowable  interleavings  of  transac¬ 
tions  be  "serializable"  (cf.  Cl]).  However,  there 
are  systems  which  can  run  acceptably  with  uncon¬ 
strained  interleavings.  In  a  banking  system,  for 
example,  a  transfer  transaction  might  consist  of 
a  withdrawal  step  followed  by  a  deposit  step.  In 
order  to  obtain  fast  performance,  the  withdrawals 
and  deposits  of  different  transfers  might  be 
allowed  to  Interleave  arbitrarily,  even  though  the 
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users  of  the  banking  system  are  thereby  presented 
with  a  view  of  the  account  balances  which  Includes 
the  possibility  of  money  being  "in  transit"  from 
one  account  to  another. 

One  useful  kind  of  transaction  is  a  "check¬ 
point"  —  a  transaction  that  reads  and  returns  all 
the  current  values  for  the  objects  of  the  system. 

In  a  bank  database,  a  checkpoint  can  be  used  to 
audit  all  of  the  account  balances  (or  the  sum  of 
all  account  balances).  In  a  population  database,  a 
checkpoint  can  be  used  to  produce  a  census.  In  a 
general  transaction  system,  the  checkpoint  can  be 
used  for  failure  detection  and  recovery:  if  a 
checkpoint  produces  an  inconsistent  system  state, 
one  assumes  that  an  error  has  occurred  and  takes 
appropriate  recovery  measures.  ^  * 

x> 

For  a  checkpoint  transaction  to  return  a 
meaningful  result,  the  individual  read  steps  of 
the  checkpoint  must  not  be  permitted  to  interleave 
with  the  steps  of  the  other  transactions;  otherwise 
an  inconsistent  state  can  be  returned  even  for  a 
correctly  operating  system,  and  it  might  be  quite 
difficult  to  obtain  useful  Information  from  such 
intermediate  results.  For  example,  in  a  bank  data¬ 
base  with  transfer  operations,  an  arbitrarily- 
interleaved  audit  might  completely  miss  counting 
some  money  in  transit  or  count  some  transferred 
money  twice,  thereby  arriving  at  an  incorrect  value 
for  the  sum  of  all  the  account  balances. 

A  checkpoint  which  is  not  allowed  to  interleave 
with  any  other  transactions  is  called  a  global 
checkpoint .  In  the  bank  database,  a  global  check¬ 
point  would  only  see  completed  transfers;  no  money 
would  be  overlooked  in  transit,  and  a  correct  sum 
would  be  obtained  for  all  account  balances.  In 
general,  a  global  checkpoint  views  a  globally 
consistent  state  of  the  system. 

In  this  paper,  we  present  a  method  of  imple¬ 
menting  global  checkpoints  in  general  distributed 
transaction  systems.  Ue  assume  one  starts  with  an 
underlying  distributed  transaction  system  known  to 
be  correct.  Next  we  add  some  checkpoint  transac¬ 
tions  C  which  are  known  to  be  correct  if  run  when 
no  other  transactions  are  running.  Call  the  result¬ 
ing  system  S  .  Finally,  we  show  how  to  transform 
S  into  a  new  system  S'  which  does  the  "same" 
thing  as  S  and  which  turns  each  of  the  transac¬ 
tions  in  C  into  a  global  checkpoint,  l.e.  one 
that  always  returns  a  view  of  a  globally  consistent 
system  state  of  the  underlying  transaction  system. 
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Our  Introduction  of  the  global  checkpoints  is 
"nonintrusive"  in  the  sense  that  no  operations  of 
the  underlying  system  need  be  halted  while  the 
global  checkpoint  is  being  executed.  Because  of 
this,  it  is  not  always  possible  to  have  the  global 
checkpoint  view  a  consistent  state  in  the  recent 
history  of  the  underlying  transaction  system,  for 
that  system  might  enter  consistent  states  only 
infrequently  because  of  heavy  transaction  traffic. 
Thus,  instead  of  viewing  a  consistent  state  that 
actually  occurs,  our  global  checkpoints  view  a 
state  that  could  result  by  running  to  completion 
all  of  the  transactions  that  are  in  progress  when 
the  global  checkpoint  begins,  as  well  as  some  of 
the  transactions  that  are  initiated  during  its 
execution. 

2.  A  Model  for  Asynchronous  Parallel  Processes 

The  formal  model  used  to  state  the  correctness 
conditions  and  describe  the  algorithm  is  that  of 
[2].  Only  a  brief  description  is  provided  in  this 
paper;  the  reader  is  referred  to  [2]  for  a  complete, 
rigorous  treatment. 

The  basic  entities  of  the  model  are  processes 
(automata)  and  variables.  Processes  have  states 
(including  start  states  and  possibly  also  final 
states) ,  while  variables  take  on  values.  An  atomic 
execution  step  of  a  process  involves  accessing  one 
variable  and  possibly  changing  the  process'  state 
or  the  variable's  value  or  both.  A  system  of 
processes  is  a  set  of  processes,  with  certain  of 
its  variables  designated  as  internal  and  others  as 
external.  Internal  variables  are  to  be  used  only 
by  the  given  system.  External  variables  are 
assumed  to  be  accessible  to  some  "environment" 

(e.g.  other  processes  or  users)  which  can  change 
the  values  between  steps  of  the  given  system. 

The  execution  of  a  system  of  processes  is 
described  by  a  set  of  execution  sequences .  Each 
sequence  is  a  (finite  or  infinite)  list  of  steps 
which  Che  system  could  perform  when  Interleaved 
with  appropriate  actions  by  the  environment.  Each 
sequence  is  obtained  by  interleaving  sequences  of 
steps  of  the  processes  of  the  system.  Each  process 
must  have  infinitely  many  steps  in  the  sequence 
unless  that  process  reaches  a  final  state. 

For  describing  the  external  behavior  of  a 
system,  certain  information  in  the  execution 
sequences  is  irrelevant.  The  external  behavior  of 
a  system  S  of  processes,  extbeh(S)  ,  is  the  set 
of  sequences  derived  from  the  execution  sequences 
by  "erasing"  information  about  process  identity, 
changes  of  process  state  and  accesses  to  internal 
variables.  What  remains  is  just  the  history  of 
accesses  to  external  variables.  This  history  takes 
the  form  of  a  sequence  of  variable  actions .  which 
are  triples  of  the  form  (u,  X,  v)  ,  where  u  is 
che  old  value  read  from  the  variable  X,  and  v  is 
the  new  value  written  by  an  atomic  step  of  the 
system.  The  external  behavior  completely  charac¬ 
terizes  the  system  from  the  user's  point  of  view; 
two  systems  with  the  same  external  behavior  are 
completely  indistinguishable  to  the  user. 


3.  An  Abstract  Distributed  Transaction  System 

In  a  database  system,  a  transaction  is  usually 
considered  to  be  a  sequence  of  operations  on  the 
database  entities  which  should  be  performed  accord¬ 
ing  to  some  concurrency  control  policy.  For  our 
purposes,  we  do  not  need  to  look  Inside  the  transac¬ 
tions  —  all  that  we  require  is  that  a  particular 
transaction  can  be  requested  at  any  time,  and  once 
requested,  it  will  eventually  run  to  completion. 

What  the  transaction  does  while  it  is  running  and 
how  it  Interacts  with  other  concurrent  transactions 
does  not  concern  us.  We  simply  assume  a  distributed 
system  which  understands  the  initiation  and  comple¬ 
tion  of  transactions  at  its  external  variables. 

We  make  the  technical  restriction  that  each 
transaction  can  be  invoked  only  once;  thus,  our 
transactions  should  be  thought  of  as  instances  of 
the  usual  database  notion  of  transaction.  We  also 
assume  that  an  infinite  number  of  transactions  are 
possible,  although  only  a  finite  number  can  be 
running  at  any  given  time;  thus,  our  systems  never 
stop. 

Formally,  an  abstract  transaction  system  is  a 
distributed  system  whose  external  variables,  called 
ports,  have  a  special  interpretation.  Let  T  be  an 
infinite  set  of  transactions .  Each  port  can  contain 
a  finite  set  of  transaction  status  words,  each  of 
which  is  a  triple  (t,  a,  s)  ,  where  teT  ,  a  is 
an  arbitrary  parameter  or  result  value  of  the 
transaction,  and  s e  {' RUNNING ' ,  'COMPLETE')  des¬ 
cribes  the  state  of  the  transaction.  We  require 
that  each  port  can  be  accessed  by  only  one  process, 
called  the  owner  of  the  port. 

The  Intended  operation  of  the  system  is  as 
follows:  A  user  initiates  a  transaction  t  with 
argument  a  by  inserting  the  triple  (t,  a, 
'RUNNING')  into  the  set  of  transaction  status 
words  in  some  port.  Eventually,  the  system 
replaces  that  triple  by  a  new  triple  (t,  b, 
'COMPLETE')  in  the  same  port.  The  value  b  is 
the  result  of  the  transaction.  We  assume  the  user 
behaves  correctly  in  not  trying  to  Initiate  the 
same  transaction  more  than  once  and  in  not  modify¬ 
ing  the  transaction  status  word  once  a  transaction 
has  been  initiated.  Likewise,  a  correct  abstract 
transaction  system  never  changes  the  ports  or 
modifies  the  transaction  status  words  except  as 
described  above. 

Thus,  a  correct  abstract  transaction  system 
running  with  a  correct  user  maintains  a  global 
Invariant  that  for  each  transaction  teT,  there 
is  at  most  one  port  containing  a  transaction  status 
word  with  t  as  first  component,  and  there  is  at 
most  one  such  word  in  that  port.  We  call  that  word, 
if  it  exists,  the  status  word  for  t  ,  and  we  say 
t  is  running  or  completed  depending  on  the  third 
component  of  its  status  word.  We  say  t  is 
latent  if  it  has  no  status  word.  The  conditions 
above  imply  that  the  only  possible  transitions  in 
the  status  of  a  transaction  are  from  latent  to 
running  and  from  running  to  completed;  moreover, 
every  running  transaction  eventually  becomes 
completed.  Note  also  that  there  is  no  a  priori 
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bound  on  the  number  of  transaction*  that  can  be 
running  simultaneously. 

4.  Checkpoint  Transactions 

Let  C=T  be  a  distinguished  set  of  transac¬ 
tions  called  checkpoints.  Members  of  T-C  are 
called  ordinary  transactions.  The  execution  of  a 
checkpoint  transaction  and  the  result  It  returns 
are  valid  If  no  other  transaction  Is  running  while 
the  checkpoint  Is.  While  we  make  no  restrictions 
on  what  a  checkpoint  does,  the  Intuition  is  that 
a  checkpoint  needs  to  look  at  a  globally  consistent 
syscem  state  In  order  to  work  properly,  that  Is,  a 
state  of  the  system  that  occurs  when  no  transactions 
are  In  progress.  For  example,  the  checkpoint  might 
be  an  audit  of  the  account  balances  in  a  simple 
banking  system,  or  It  might  be  a  consistency  check 
in  a  file  system.  These  two  examples  are  pursued 
further  In  Section  6. 

Our  goal  in  this  paper  Is,  given  a  transaction 
syscem  S  with  checkpoints,  to  construct  a  new 
transaction  system  S'  which  does  the  "same"  thing 
as  S  for  non-checkpoint  transactions  and  which 
returns  a  valid  result  for  each  checkpoint  transac¬ 
tion.  A  straightforward  Implementation  of  S ' 
would  simply  suspend  further  processing  of  transac¬ 
tions  when  a  checkpoint  is  requested,  wait  for  any 
transactions  currently  in  progress  to  complete,  and 
then  run  the  checkpoint.  After  the  checkpoint  has 
been  completed,  normal  processing  of  transactions 
can  be  resumed. 

In  many  practical  situations,  however,  such 
a  solution  is  highly  undesirable,  for  the  entire 
system  must  wait  while  a  checkpoint  is  being  per¬ 
formed.  This  Is  likely  to  take  a  considerable 
length  of  time  since  checkpoints  may  require 
reading  the  entire  system  state. 

In  Section  5,  we  present  a  solution  which 
permits  checkpoints  to  be  run  concurrently  with 
the  normal  processing  of  ordinary  transactions. 

The  price  we  pay  Is  in  having  a  slightly  less 
appealing  correctness  condition  for  the  result  of 
the  checkpoint.  Since  normal  transactions  are  not 
suspended,  the  system  may  never  reach  a  globally 
consistent  state,  so  It  is  not  obvious  how  a 
meaningful  result  can  be  obtained  at  all.  Our 
approach  Is  to  run  the  checkpoint  on  a  globally 
consistent  state  obtained  by  the  following  steps: 

1.  Disable  the  Initiation  of  further  transactions 
at  each  of  the  ports. 

2.  Run  to  completion  any  transactions  in  progress. 

To  do  this  and  still  not  interfere  with  the  process¬ 
ing  of  normal  transactions  requires  that  we  split 
the  computation  into  two  parallel  branches.  One 
branch  continues  to  simulate  S  on  the  ordinary 
transactions;  the  other  branch  handles  the  execu¬ 
tion  of  the  checkpoint  as  described  above.  When 
the  checkpoint  is  complete,  the  result  Is  stored 
back  in  the  appropriate  transaction  status  word 
and  the  branch  discarded. 

Consequences  of  this  strategy  are: 


(a)  The  value  returned  by  a  checkpoint  does  not 
reflect  what  actually  happened  In  the  history 
of  execution;  only  what  might  have  happened  if 
certain  transactions  initiated  after  the  start 
of  the  checkpoint  had  not  occurred. 

(b)  Any  side-effects  of  checkpoint  transactions 
are  discarded,  so  other  transactions  continue 
to  operate  as  If  no  checkpoints  had  ever  taken 
place. 

With  this  motivation  In  mind,  we  now  turn  to 
the  definitions  needed  to  state  the  formal  correct¬ 
ness  conditions  for  the  system  S '  . 

Let  X  be  a  port  and  u,  v  be  sets  of 
transaction  status  words.  We  call  the  variable 
action  (u,X,v)  a  port  action.  Let  PA  be  the 
set  of  all  port  actions.  A  behavior  sequence  Is  a 
finite  or  infinite  sequence  of  port  actions,  l.e. 

Aa 

a  member  of  8  «  PA  u  PA  . 

Let  h  be  a  map  which  erases  checkpoint 
status  words  from  port  values,  that  Is,  If  u  Is 
a  set  of  transaction  status  words,  then 

h(u)  «  {(t,a,s)  |  (t,a,s)  e  u  and  t  e  T-C}. 

Extend  h  to  port  actions  by 

h((u,X,v))  -  (h(u) ,  X,  h ( v) )  . 

Extend  h  further  to  8  by  applying  It  component¬ 
wise. 

Let  ecB  .  Define  two  functions: 
running(e)  ■  {t  e  T  |  e  -  e^u.X.v)  •  e2  and 

(t,a, 'RUNNING')  e  u  for  (t,a, 
'RUNNING')  some  transaction 
status  word,  e2 e PA  ,  (u,X,v)  e 
PA,  and  e^t  8}  . 

completed(e)  -  {  t  t  T  |  e  -  e^Cu.X.v)  -e2  and 
(t,b, 'COMPLETED')  ev  for  (t.b, 

' COMPLETED ' )  some  transaction 
status  word,  e^  t PA  ,  (u,X,v)  e 
PA,  and  e2  e  8}. 

Thus,  runnlng(e)  is  the  set  of  transactions  which 
are  running  at  some  time  during  e  ,  and  com¬ 
pleted  (e)  is  the  set  of  transactions  which  have 
completed  in  e  . 

Let  e  c  B  ,  tc  running(e)  ,  and  1  c  N  .  t 
starts  at  step  1  of  e  if  1  is  the  length  of 
the  longest  prefix  e^  of  e  for  which  1 4 
runnlng(e^)  . 

An  abstract  distributed  transaction  system  S' 
is  a  faithful  Implementation  of  a  system  S  with 
checkpoint  set  C  if  the  following  conditions  hold. 

1.  (Faithfulness).  Let  e e  extbeh(S)  such  that 
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h(e)  -  e  (i.e.  e  contains  no  checkpoint 
transactions).  Let  a:  C-*-N  be  a  partial 
function  with  domain  dcrm(a)  .  Then  there 
exists  e'  £ extbeh(S')  such  that  h(e')-e, 
running(e')  »  running(e)  u  doo(o) ,  and  for  all 
t £  doo(o) ,  t  starts  at  step  o(i)  in  e'. 

2.  (Safety).  Let  e '  £  extbeh(S ') .  Then  h(e')  £ 
extbeh(S) . 

3.  (Validity  of  checkpoints).  Let  e'  e extbeh(S ') , 
c  e  C  ,  and  suppose  c  runs  to  completion  in  e ' 
and  produces  result  b  .  Let  1  be  the  step  at 
which  c  starts  in  e'  ,  and  let  e^'  be  the 

prefix  of  e'  of  length  i  .  Let  e^ '  be  the 
shortest  word  such  that  e^'e2'  is  a  prefix  of 
e'  and  c £ corapleted(e^ 'e2 ' ) .  Then  there 

exists  e  £ extbeh(S)  such  that  c  runs  to 
completion  in  e  and  produces  result  b  ,  and 
e  satisfies  the  following.  There  exist  words 


er  e2 

,  f  such  chat  e2e2^  is  a  prefix  of 

and 

(i) 

h(ele2)  "  ele2  ; 

(ii) 

h(e1’)-e1  ; 

(iii) 

running(e1e2)  c  nraning(e1'e2')  ; 

(iv) 

c  €  completed(e^e2f)  and  {c} 

“  running (e^ejf)  -  completed(e2e2)  . 

Conditions  (1)  and  (2)  insure  that  S' 
faithfully  simulates  S  on  the  non-checkpoint 
transactions  and  that  the  presence  or  absence  of 
checkpoint  transactions  does  not  affect  the  process¬ 
ing  of  other  transactions  by  S'  .  Condition  (3) 
insures  that  S'  computes  acceptable  results  for 
the  checkpoint  transactions.  In  particular,  the 
result  of  each  checkpoint  must  be  a  value  obtain¬ 
able  by  some  computation  of  S  which  (1)  runs  no 
checkpoints  before  the  given  checkpoint,  (ii) 
agrees  with  the  computation  of  S'  up  to  the 
point  whera  the  checkpoint  began  (again  ignoring 
other  checkpoints) ,  (iii)  only  initiates  transac¬ 
tions  thereafter  which  actually  occurred  in  S', 
and  (iv)  runs  the  checkpoint  after  all  the  transac¬ 
tions  in  progress  at  the  time  of  the  checkpoint 
request  together  with  any  transactions  initiated 
after  the  checkpoint  have  completed,  thereby 
insuring  a  valid  result. 

5.  A  Faithful  Implementation 

Given  an  abstract  distributed  transaction 
system  S  with  checkpoint  set  C  ,  we  sketch  hov 
to  construct  a  new  system  S'  which  faithfully 
Implements  S  . 

S'  operates  by  simulating  a  number  of  copies 
of  S  :  a  "base"  copy  Sq  and  a  copy  Sc  for 

each  c  <  C  .  Sq  processes  all  of  the  non¬ 
checkpoint  transaction  requests  received  by  S'  , 
and  S^  processes  checkpoint  transaction  c  . 


Sq  Ignores  checkpoints  but  otherwise  acts 

just  like  S  .  Sc  ,  c  e  C  ,  does  exactly  the  same 

thing  as  Sq  up  until  checkpoint  c  is  requested. 

At  that  time,  the  computation  of  S  begins  to 

diverge  from  that  of  Sn  .  S  continues  behaving 
u  c 

like  S  ,  but  it  starts  ignoring  certain  new  trans¬ 
actions  that  are  being  processed  by  SQ  .  Even¬ 
tually,  it  ceases  processing  new  transactions 
entirely,  and  all  the  transactions  currently  in 
progress  are  run  to  completion.  At  that  time,  S^ 

runs  checkpoint  transaction  c  ,  and  when  it  com¬ 
pletes',  Sc  writes  the  result  back  into  the  trans¬ 
action  status  word  at  the  Initiating  port.  has 

then  completed  its  task  and  can  terminate. 

The  structure  of  S'  is  similar  to  that  of 
S  .  Each  process  and  variable  of  S  haa  a  corres¬ 
ponding  process  or  variable  in  S'  .  Process  k 
of  S'  simulates  process  k  in  each  of  the  S^  , 

1  s{0) u  C  .  Similarly,  Internal  variable  X  of 
S'  simulates  internal  variable  X  in  each  of  the 
S^  .  The  states  of  processes  in  S '  are  labelled 

sets  of  states  of  corresponding  processes  of  S  , 
and  values  of  variables  in  S'  are  labelled  sets 
of  values  of  corresponding  variables  of  S  ,  where 
the  labels  are  taken  from  {0} u  C  .  S  and  S' 
have  identical  ports  and  port  values. 

We  now  describe  in  some  detail  the  operation 
of  the  processes  in  Sc  .  Each  process  does 

exactly  the  same  thing  as  the  corresponding  process 
of  Sq  until  it  learns  that  checkpoint  c  has 

been  requested.  There  are  three  ways  that  a  process 
might  learn  this.  It  might  access  its  port  and  see 
the  transaction  status  word  for  c  .  In  this  case, 
that  process  is  called  the  checkpoint  initiator. 
Secondly j  it  might  receive  a  "message"  from  the 
checkpoint  initiator  Informing  it  of  the  start  of 
the  checkpoint.  Finally,  it  might  read  an  internal 
variable  and  detect  that  the  computation  of  S£ 

has  begun  to  diverge  from  that  of  Sq  ,  enabling  it 
to  deduce  that  the  checkpoint  has  started. 

When  the  checkpoint  initiator  discovers  the 
start  of  the  checkpoint,  it  broadcasts  this  fact 
to  the  other  processes  of  Sc  .  Each  process  of 

Sc  upon  learning  of  the  initiation  of  the  check¬ 
point  makes  a  private  copy  of  its  port  and  there¬ 
after  refers  to  its  private  copy  rather  than  the 
real  port.  In  this  way,  future  transaction 
requests  are  ignored  by  Sc  ,  and  results  of 

transactions  produced  by  Sc  (which  might  differ 
from  those  produced  by  Sq)  do  not  affect  the  real 
ports.  When  a  process  of  Sc  finally  discovers 

that  all  of  the  transactions  at  its  port  have  com¬ 
pleted,  it  sends  back  an  acknowledgement  to  the 
checkpoint  initiator.  When  the  initiator  has 
received  an  acknowledgement  from  each  process 
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(Including  itself),  it  begins  processing  the  check¬ 
point  by  placing  the  checkpoint  request  in  its  own 
private  copy  of  its  port.  All  of  the  processes  of 
continue  operating  and  serve  collectively  to 

process  the  checkpoint  c  .  When  c  completes, 
che  checkpoint  initiator  copies  che  final  transac¬ 
tion  status  word  for  c  from  the  private  copy  of 
its  port  back  into  the  real  port. 

the  correctness  conditions  of  Section  4  are 
quite  strong  and  do  not  permit  S'  to  make  any 
accesses  to  the  ports  other  than  those  made  by  Sq  . 

Therefore,  the  simulation  of  the  sc  ,  ctC  ,  must 
be  coordinated  with  that  of  SQ  so  that  all  real 
port  accesses  by  Sc  are  "piggybacked"  onto  port 
accesses  by  SQ  .  The  basic  strategy  is  that  SQ 
runs  freely,  but  a  process  of  Sc  wishing  to 

access  the  real  port  must  wait  until  the  corres¬ 
ponding  process  of  SQ  is  ready  to  make  its  next 

port  access.  The  two  (or  more)  accesses  are  then 
combined  into  one  and  performed  simultaneously. 

The  accesses  never  conflict  because  each  process  of 
does  the  exacc  same  thing  as  the  corresponding 

process  of  Sq  up  until  the  point  where  it  dis¬ 
covers  the  start  of  the  checkpoint.  Thereafter,  it 
only  modifies  the  status  word  for  c  ,  whereas 
processes  of  Sq  only  modify  status  words  for 

ordinary  transactions. 

At  any  point  in  the  computation,  only  a  finite 
set  D  of  checkpoints  have  ever  been  Initiated,  so 
che  computation  of  every  ,  cj  C-D  ,  is  identi¬ 
cal  to  the  computation  of  SQ  and  need  be  repre¬ 
sented  only  once.  As  soon  as  a  process  of  S' 
discovers  that  checkpoint  c  is  in  progress, 
either  by  being  the  checkpoint  initiator,  receiving 
a  message  from  the  checkpoint  initiator,  or  by 

reading  an  Internal  variable  in  which  the  cth 

component  differs  from  the  0th  ,  it  splits  the 
simulation  of  Sc  from  that  of  Sq  and  from  then 

on,  the  two  simulations  continue  independently,  as 
described  above.  Hence,  S'  actually  simulates 
a  finite  but  growing  set  of  computations . 

In  order  to  carry  out  the  above  implementation, 
S'  needs  a  mechanism  which  permits  the  checkpoint 
initiator  to  communicate  with  every  other  process. 

In  any  particular  application,  such  a  communica¬ 
tion  mechanism  would  probably  already  exist  in  the 
underlying  system  S  .  However,  if  it  isn't 
already  there,  then  we  require  that  S'  be 
augmented  with  such  a  facility. 

Theorem.  Let  S  be  an  abstract  distributed 
transaction  system  with  checkpoint  set  C  ,  and 
let  S'  be  the  system  described  above.  Then  S' 
is  a  faithful  implementation  of  S  . 

Proof  Sketch.  We  omit  the  tedious  but 
straightforward  verification  that  S'  satisfies 


the  conditions  for  being  a  faithful  implementation 
of  S  .  It  remains  to  verify  however  that  S '  is 
a  correct  abstract  distributed  transaction  system, 
that  is,  that  every  transaction  which  is  requested 
will  eventually  run  to  completion. 

This  property  holds  for  non-checkpoint 
transactions  by  the  safety  property  and  the  fact 
that  it  holds  for  S  .  It  holds  for  checkpoint 
transactions  because  each  of  the  phases  in  process¬ 
ing  a  checkpoint  terminates.  Eventually  a  request 
for  checkpoint  c  gets  noticed  by  the  process  of 
Sc  which  owns  the  port;  otherwise,  S£  and  hence 

S  would  fail  to  process  future  transactions  origi¬ 
nating  at  that  port.  After  the  checkpoint  request 
is  noticed,  the  checkpoint  initiator  notifies  all 
other  processes  of  Sc  ;  hence,  eventually  all  of 

the  other  processes  learn  of  the  request.  After 
each  process  becomes  aware  of  the  checkpoint,  it 
stops  accepting  requests  for  new  transactions; 
hence,  eventually  Sc  stops  processing  new 

transactions.  S  continues  to  simulate  S  on  the 

c 

transactions  that  it  has  accepted;  they  all  even¬ 
tually  complete  since  they  would  in  S  .  Each 
process  eventually  acknowledges  completion  to  the 
initiator,  so  eventually  the  checkpoint  transac¬ 
tion  itself  is  started.  S  continues  to  simulate 

c 

S  ,  so  eventually  the  checkpoint  transaction  will 
complete  and  produce  a  valid  result,  which  is 
copied  back  into  the  port. 

Hence,  S'  is  an  abstract  distributed  trans¬ 
action  system  which  faithfully  implements  S  ,  as 
required.  ^ 

We  remark  that  under  certain  naturally- 
occurring  conditions,  the  efficiency  of  S'  can 
be  made  to  approach  the  efficiency  of  S  .  Namely, 
assume  that  all  checkpoint  transactions  originate 
at  the  same  port.  Then  it  is  an  easy  matter  to 
modify  the  checkpoint  initiator  so  that  only  one 
checkpoint  is  handled  at  a  time.  If  several  are 
requested  simultaneously,  the  initiator  will  pick 
one  to  process  and  wait  until  it  completes  before 
handling  another.  Since  only  one  checkpoint  c 
is  running  at  a  time,  each  process  of  S'  need 
only  simulate  two  processes:  the  corresponding 
process  of  Sq  and  the  corresponding  process  of 

S^  .  When  a  process  becomes  aware  of  the  request 
of  some  checkpoint  S^  ,  d  j  c  ,  then  it  knows  that 

checkpoint  c  must  have  completed;  hence  it 
terminates  the  simulation  of  .  Thus,  the 

storage  needed  by  S'  for  the  internal  variables 
and  process  states  is  only  double  that  of  S  . 

(In  practice,  one  would  probably  only  keep  dupli¬ 
cate  copies  of  those  objects  for  which  the  cwo 
executions  Sq, Sc  really  produce  different 

values.)  Likewise,  the  time  required  by  S'  , 
when  appropriately  measured,  should  be  at  worst 
double  that  of  S  on  the  particular  computations 
actually  simulated. 


6.  Applications  of  Global  Checkpoints 

Global  checkpoints  can  play  an  Important  role 
In  the  design  of  distributed  systems  for  error 
detection,  error  recovery,  or  both. 

For  error  detection,  their  use  Is  In  identify¬ 
ing  Inconsistencies  In  global  system  states  that 
should  be  consistent.  We  have  already  alluded  to 
this  use  In  the  simple  banking  system  example  In 
which  the  only  allowable  transactions  are  to  trans¬ 
fer  funds  from  one  account  to  another.  The  sum  of 
the  account  balances  Is  the  same  in  every  globally 
consistent  state.  Therefore,  our  algorithm  can  be 
used  to  obtain  that  sum  by  running  a  global  check¬ 
point  transaction  which  simply  reads  each  of  the 
account  balances  and  adds  them  all  up.  An  error 
is  Indicated  if  this  sum  is  not  what  was  expected. 

A  similar  situation  occurs  in  the  design  of 
file  systems.  Often  a  directory  must  be  kept  con¬ 
sistent  with  the  actual  contents  of  a  disk.  A  global 
checkpoint  might  read  the  items  in  the  directory 
and  check  that  they  correspond  with  what  is  really 
on  the  disk.  As  long  as  no  directory  modification 
transactions  were  in  progress  when  the  checking 
was  done,  then  a  discrepency  would  indicate  a  true 
file  system  error.  Our  global  checkpoint  algorithm 
can  be  used  to  detect  such  inconsistencies. 

For  error  recovery,  global  checkpoints  can  be 
used  to  save  the  relevant  part  of  the  global  state 
of  the  system  so  that  in  the  event  of  a  crash,  the 
system  can  later  be  restarted  from  that  point  in 
the  computation.  For  example,  a  global  checkpoint 
could  be  used  to  provide  a  restart  capability  in 
the  migrating  transaction  model  of  Rosenkrantz, 
et  al.  [A]  by  having  it  return  the  values  of  all 
of  the  entities  in  the  database. 

Another  such  application  arises  in  the  use  of 
the  Eden  system  which  is  being  developed  at  the 
University  of  Washington  [3].  That  system  is 
object-based  and  Includes  as  a  primitive  kernel 
operation  a  checkpoint  operation  that  writes  a 
single  object  to  stable  storage.  The  object  itself 
decides  when  it  is  in  a  consistent  state  and  hence 
when  the  checkpoint  can  be  performed.  If  the 
object  later  crashes,  it  is  restored  from  the 
version  on  stable  storage.  To  extend  this  check¬ 
point  facility  to  groups  of  related  and  cooperating 
Eden  objects  requires  that  the  objects  coordinate 
their  checkpointing  activities  so  that  the  versions 
saved  on  stable  storage  are  globally  and  not  just 
locally  consistent.  That  is  just  the  problem  we 
have  been  treating  in  this  paper  if  we  take 
"transaction”  to  mean  the  portion  of  computation 
that  an  individual  Eden  object  is  in  an  inconsis¬ 
tent  state,  and  if  we  assime  further  that  an 
object  only  enters  an  inconsistent  state  in 
response  to  some  external  stimulus  (corresponding 
to  a  transaction  request) .  Our  global  checkpoint 
algorithm  could  then  be  applied  to  produce  a 
globally  consistent  system  state  on  stable  storage 
by  running  the  "global  checkpoint"  transaction 
which  simply  checkpoints  each  of  the  objects  in 
the  group.  Note  that  our  algorithm  requires  the 
independence  of  transactions.  If  one  transaction 
can  initiate  another  and  then  wait  for  its 


completion,  then  completion  of  the  first  depends 
on  completion  of  the  second,  and  our  algorithm, 
which  might  decide  to  exclude  the  second  from  a 
checkpoint,  would  wait  forever  for  the  first  trans¬ 
action  to  complete.  Our  formal  definition  of  a 
transaction  system  excludes  the  possibility  of 
system- initiated  transactions. 
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